Analysis of Research Productivity at LSE
¶

Table of Contents¶

1) Introduction

2) Data Acquisition
2.1 LSE department staff
2.2 LSE research

3) Data Manipulation & Exploration
3.1 Loading the datasets
3.2 Initial exploration plot
3.3 Data manipulation

4) Data Analysis
4a. How does research productivity vary across departments?
4b. What are the factors affecting the average productivity?
      - Department Size
      - Research Staff Ratio
      - Professor Ratio
      - Dr Ratio
      - External Collaborator Ratio

5) Conclusion

1. Introduction ¶

LSE is renowned for their research publications, particularly as a university specialising in social science. It was recognised by the 2021 Research Excellence Framework (REF) as the top university in the UK based on proportion of world leading (four star) research outputs produced–58% of all research produced by the university was world-leading (https://www.lse.ac.uk/News/Latest-news-from-LSE/2022/e-May-22/REF-2021-results). It follows then that LSE devotes a significant percent of their funding to research. The current way that this is done is by allocating £3,000 to each eligible member of staff per full-time equivalent, according to the 2023-24 Departmental Funding Guidelines.

However, this is not the most efficient way of allocating funds as various departments have different levels of research potential. We want to investigate the research productivity within LSE across different departments. This information is not only useful for budgeting purposes but may also prove helpful to other parties. The research productivity analysis could help potential PhD students who are deciding between similar departments, such as sociology and anthropology, and want to consider research productivity as the deciding factor. Or even just for any student considering LSE as an institution to study at.

Apart from analysing the productivity differences, we also want to investigate potential root causes for any discrepancies. Some factors we are considering are the size of the department as measured by number of stuff, the proportion of research staff, the proportion of drs vs. professors, and external collaboration.

We have been unable to find any similar analysis done on this topic with the focus seemingly more on the quality of research published in different departments in LSE. As LSE prides itself on its research, we hope this analysis provides some insight into topics that haven’t been considered before.

We aim to answer the questions:

  1. How does research productivity vary across departments?
  2. What are the factors affecting average productivity?
    • Department Size
    • Research Staff Ratio
    • Professor Ratio
    • Dr Ratio
    • External Collaborator Ratio

note:

Due to time limitation and considering the complexity of the data acquisition part, we will be focusing on 12 departments:
Social Policy, Anthropology, Finance, Mathematics, Statistics, Psychological and Behavioural Science, International Relations, Management, Sociology, Geography and Environment, Economic History, Government.

2. Data Acquisition¶

2.1 LSE department staff¶

To investigate specific departments, we require staff information for each department. This data is available in the staff section of each of LSE's departmental specific pages. To convert the data into a format that we could anlayse and manipulate, we decided to use webscraping. Considering that some departments have very different webpage and html structures, for convenience's sake, we first webscraped the departments with similar formats. As this only gave us a few departments and not enough for a sufficiently well rounded analysis, we then further webscraped some more departments, giving us a total of 12 departments.

As it is tedious and repetitive, we ommited the process in this document and provided a separate notebook for the research data acquisition in Data Acquisition - Departmental Staff Data via Webscraping.ipynb.

We stored the information on staff members' departments, names, label, and title (Dr, Professor) in a dataframe and chose to omit PhD students as both the research publications and the funding does not relate to them. Then, based on their label, we identified each staff member as research based or non-research based and put that category in a new column.

More details on how we acquired and cleaned the data set and dealt with various problems can be found in the data acquisition notebook with the final prepared data set in the Data folder as a csv file called departmental_staff_data.csv

2.2 LSE research¶

In order to discuss and explore the research productivity, we have to obtain the information about LSE research, which is available in a LSE research database here: https://eprints.lse.ac.uk/ . While this data can be webscraped, it is already available in JSON format, which is semi-structured and so much easier to convert to a dataframe and manipulate, which we decided to take advantage of to extract the information. All the required JSON files can be found under the Data folder as well as the final csv file titled departmental_publications_data.csv.

For each publication, we stored the title, the department, the date, the title of the authors, number of authors, and number of authors who are LSE staff based on whether they had an LSE Institute ID. Getting this data was much more straightforward than webscraping, however we have ommited the process in this document and provided a separate notebook for the research data acquisition in Data Acquisition - Publications per Department via JSON.ipynb. More details on how we dealt with the data can be found in that notebook.

3. Data Manipulation¶

3.1 Loading the datasets¶

For the webscraped department staff information, we have already done some preliminary data manipulation inside the notebook. This is bacause some information obtained from webscraping are unnecessary or duplicated. More details can be found here Data Acquisition - Departmental Staff Data via Webscraping.ipynb.

Here, we will use the csv files stored from the two data acquisition notebooks. First of all, let's take a look at the structure of the datasets.

The staff dataset contains the information of all the staff members of the selected 12 departments, with information of their names, department they come from, labels, title (professor/Dr/or neither), and whether they are research or non-research staff. The last column of whether they are research staff is derived from the labels. We will be only using research column not the label column as there are too many unique lables. The code and details can be found in notebook Data Acquisition - Departmental Staff Data via Webscraping.ipynb .

The publications dataset contains the information of all the publications under selected 12 departments we obtained from LSE research database website. It contains the title, department, data it is published, all the author names, number of authors, and number of LSE authors for each publication.

In [1]:
import pandas as pd
import seaborn as sns
publications = pd.read_csv("Data/departmental_publications_data.csv")
staff = pd.read_csv("Data/departmental_staff_data.csv")
display(staff.tail())
display(publications.head())
Name Department Label Title Category
1165 Paul Willman Management Other academic and research staff Professor Research
1166 Mohamed Abouaziza Management Other academic and research staff Dr Research
1167 Anushri Gupta Management Other academic and research staff Dr Research
1168 Philipp Schoenegger Management Other academic and research staff Dr Research
1169 Oliver Seager Management Other academic and research staff NaN Research
Title Department Date Authors NumberOfAuthors NumberOfStaffAuthors
0 British incomes and property in the early nine... Economic History 01-12-1959 Patrick O'Brien 1 1
1 National assistance: service or charity? Social Policy 01-01-1962 Howard Glennerster 1 1
2 Twelve wasted years Social Policy 01-01-1963 Howard Glennerster 1 1
3 Public schools Social Policy 01-01-1964 Howard Glennerster 1 1
4 Man as tranducer for probabilities in Bayesian... Management 01-01-1964 W. Edwards, Lawrence D. Phillips 2 1

The date information we obtained from JSON file are not very useful since some of the data are not accurate. As shown in the dataframe, if the exact date is missing in JSON file, it would become the first day of that month automatically.

Instead of date, we decide to use the more accurate year information. We extract the year from the date column as following:

In [2]:
publications['Year'] = publications['Date'].str[-4:].astype(int)
publications.head(3)
Out[2]:
Title Department Date Authors NumberOfAuthors NumberOfStaffAuthors Year
0 British incomes and property in the early nine... Economic History 01-12-1959 Patrick O'Brien 1 1 1959
1 National assistance: service or charity? Social Policy 01-01-1962 Howard Glennerster 1 1 1962
2 Twelve wasted years Social Policy 01-01-1963 Howard Glennerster 1 1 1963

3.2 Initial exploration plot¶

To measure productivity, we want to use the total number of publications divided by the total number of staff for each department.

However, directly using the publication dataset can be problematic. The publication dataset has the oldest publication dating back to 1959. However, in 1959, not all departments have been established yet, and perhaps not all publications of that times were recorded.

Besides, the total number of staff for each department can also vary across years. Therefore ideally we should only be focusing the publications in the recent years, making the assumption that the changes in total number of staff for each department are negligible.

To determine the valid time period we will be focusing on, we use a lineplot to visualize the publications throughout decades to see from what time on the publication situation becomes stable.

Preparing dataframe for visualization

To visualize the data, we have to reorganize the original dataset to form a useful one for visualization. We use groupby function on the original dataframe to summarize the total publications for each department and each year.

In [3]:
all_departments = publications['Department'].unique()
all_years = publications['Year'].unique()

DF=pd.DataFrame([(department, year) for department in all_departments for year in all_years],
                                columns=['Department', 'Year'])
TotalPub=publications.groupby(['Department', 'Year']).size().reset_index(name='Total Publications')
DF=pd.merge(DF, TotalPub, on=['Department', 'Year'], how='left').fillna(0)
DF=DF.sort_values(by=['Year','Department']).reset_index()
DF['Total Publications']=DF['Total Publications'].astype('int')

DF
Out[3]:
index Department Year Total Publications
0 384 Anthropology 1959 0
1 0 Economic History 1959 1
2 640 Finance 1959 0
3 512 Geography and Environment 1959 0
4 448 Government 1959 0
... ... ... ... ...
763 767 Mathematics 2024 17
764 639 Psychological and Behavioural Science 2024 44
765 127 Social Policy 2024 29
766 319 Sociology 2024 18
767 383 Statistics 2024 15

768 rows × 4 columns

Visualization line plot

We then visualize the total publications situation through an interactive plot. We chose to use the interactive plot, as the slide bar allows us to zoom in specific years we are more interested, and to examine the changes and patterns closely.

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

fig = px.line(DF, x=DF.Year, y='Total Publications', color='Department',height=400)
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(title='Total Publications Over Time',title_x=0.2)
fig

Observations

As shown in the lineplot, as we zoom in the time period towards more recent times, we can see that there is a major drop for all departments in 2019 when the pandemic started.

After pandemic, starting from 2020, the total publications become more stable. The last drop in total publications at 2024 is because this year has just started and only the works in the first three months are recorded.

Considering the patterns, we decide to use the data of the recent 4 years, from 2020-2023(inclusive).

Using a shorter timeframe also ensures that the staff data we use is relevant as it is unlikely for each department to change their staff set up drastically enough to have a large impact on our analysis in just the last few years.

3.3 Data manipulation¶

After determining the time periods we will be looking at (2020-2023), we then proceed to the investigation of departmentwise productivity differences and the factors affecting the productivity.

To investigate further, preparing the relevant calculations and reorganizing the datasets is required.

We currently have two datasets: publication and staff. We will first perform calculations on these datasets separately and merge the two after separate manipulations. For the first publication dataset, we need to obtain the Total publications and Collaboration ratio for each department between 2020-2023. As for the staff dataset, for each department we need to calculate the Total staff, Proportion of research staff, Proportion of professors, and Proportion of Dr.s.

To perform these calucalations whilst filtering the data, we believe it is way more efficient to do so through sql querying via database than pandas dataframe. Therefore, we will be conducting the calculations and filtering throught sql in the following.

Publication Dataset

To use database and sql querying, we first need to convert the csv table to a db. This is done through sqlite3.

In [5]:
import sqlite3
conn = sqlite3.connect('Data/pubs.db')
publications.to_sql('pubs', conn, if_exists='replace', index=False)
conn.close()

For better showcasing the results, we chose to use sql magic to perform querying. We installed the sql magic and connected to the database.

In [6]:
%load_ext sql
In [7]:
%sql sqlite:///Data/pubs.db
Connecting to 'sqlite:///Data/pubs.db'

To make sure the table is successfully transferred to the db, we checked whether the total number of rows is correct, and whether the structure of the db is coherent with the dataframe table

In [8]:
len(publications)
Out[8]:
32348
In [9]:
%sql SELECT COUNT(*) FROM pubs;
Running query in 'sqlite:///Data/pubs.db'
Out[9]:
COUNT(*)
32348
In [10]:
%sql SELECT * FROM pubs LIMIT 3;
Running query in 'sqlite:///Data/pubs.db'
Out[10]:
Title Department Date Authors NumberOfAuthors NumberOfStaffAuthors Year
British incomes and property in the early nineteenth century Economic History 01-12-1959 Patrick O'Brien 1 1 1959
National assistance: service or charity? Social Policy 01-01-1962 Howard Glennerster 1 1 1962
Twelve wasted years Social Policy 01-01-1963 Howard Glennerster 1 1 1963

We then use sql query to calculate the total publications and the ratio of the externally collaborated publications.

  • The WHERE clause is to filter tha data from 2020-2023;
  • GROUP BY is used since we wanted the aggregated data for each department;
  • COUNT(*) gives the total number of publications;
  • To calculate ratio of externally collaborated works, we usedAVG(CASE WHEN NumberOfAuthors>NumberOfStaffAuthors THEN 1 ELSE 0 END). Case 1 when number of authors > number of staff authors, this is the situation when there is at least one co-author from other institutions, which mean external collaboratino.
In [11]:
%config SqlMagic.displaylimit = 15
In [12]:
%%sql

SELECT Department,
        COUNT(*) AS TotalPublications,
        ROUND(AVG(CASE WHEN NumberOfAuthors>NumberOfStaffAuthors THEN 1 ELSE 0 END),2) AS CollabRatio 
FROM pubs
WHERE 2020<=Year AND Year<=2023
GROUP BY Department
ORDER BY 1,2;
Running query in 'sqlite:///Data/pubs.db'
Out[12]:
Department TotalPublications CollabRatio
Anthropology 205 0.39
Economic History 173 0.48
Finance 130 0.76
Geography and Environment 667 0.57
Government 426 0.42
International Relations 378 0.35
Management 428 0.71
Mathematics 261 0.77
Psychological and Behavioural Science 579 0.55
Social Policy 519 0.58
Sociology 219 0.37
Statistics 298 0.76

After obtaining the required information, we transformed the table into pandas dataframe, for later visualization.

In [13]:
%%sql result << 
SELECT Department,
        COUNT(*) AS TotalPublications,
        ROUND(AVG(CASE WHEN NumberOfAuthors>NumberOfStaffAuthors THEN 1 ELSE 0 END),2) AS CollabRatio 
FROM pubs
WHERE 2020<=Year AND Year<=2023
GROUP BY Department
ORDER BY 1,2;
Running query in 'sqlite:///Data/pubs.db'
In [14]:
merge1=result.DataFrame()
merge1.head()
Out[14]:
Department TotalPublications CollabRatio
0 Anthropology 205 0.39
1 Economic History 173 0.48
2 Finance 130 0.76
3 Geography and Environment 667 0.57
4 Government 426 0.42
In [15]:
%sql --close sqlite:///Data/pubs.db

Staff dataset

After manipulating the first publication dataset, we then turn to the second staff dataset. The procedure is almost the same as the first publication dataset. We first need to convert the dataframe to a db and then use sql magic for querying.

In [16]:
conn = sqlite3.connect('Data/staff.db')
staff.to_sql('staff', conn, if_exists='replace', index=False)
conn.close()
In [17]:
%sql sqlite:///Data/staff.db
Connecting and switching to connection 'sqlite:///Data/staff.db'

We then check if the dataset is successfully and completely transferred into db, by checking the number of rows, and the structure of the table.

In [18]:
len(staff)
Out[18]:
1170
In [19]:
%sql SELECT COUNT(*) FROM staff;
Running query in 'sqlite:///Data/staff.db'
Out[19]:
COUNT(*)
1170
In [20]:
%sql SELECT * FROM staff LIMIT 3;
Running query in 'sqlite:///Data/staff.db'
Out[20]:
Name Department Label Title Category
Fabio Battaglia Social Policy Academic staff Dr Research
Liam Beiser-McGrath Social Policy Academic staff Dr Research
Thomas Biegert Social Policy Academic staff Dr Research

After the successful transfer, we then perform the calcuations through sql querying. For this dataset, we need to calculate the total staff, proportion of research staff, professor ratio, Dr.s ratio.

  • GROUP BY clause is used since we wanted the aggregated data for each department;
  • COUNT(*) gives the total number of staff;
  • CASE WHEN is used to calculate the percentage of data with specified condition after WHEN.
In [21]:
%%sql

SELECT Department, 
        COUNT(*) AS TotalStaff,
        ROUND(AVG(CASE WHEN Category='Research' THEN 1 ELSE 0 END),2) AS ResearchRatio, 
        ROUND(AVG(CASE WHEN Title='Dr' THEN 1 ELSE 0 END),2) AS DrRatio, 
        ROUND(AVG(CASE WHEN Title='Professor' THEN 1 ELSE 0 END),2) AS ProfRatio
FROM staff
GROUP BY Department;
Running query in 'sqlite:///Data/staff.db'
Out[21]:
Department TotalStaff ResearchRatio DrRatio ProfRatio
Anthropology 62 0.9 0.6 0.27
Economic History 66 0.65 0.29 0.35
Finance 95 0.41 0.31 0.18
Geography and Environment 128 0.59 0.23 0.18
Government 157 0.68 0.38 0.27
International Relations 100 0.75 0.48 0.21
Management 165 0.64 0.39 0.22
Mathematics 83 0.59 0.27 0.3
Psychological and Behavioural Science 108 0.76 0.52 0.17
Social Policy 76 0.75 0.46 0.29
Sociology 70 0.79 0.61 0.24
Statistics 60 0.87 0.55 0.27

We then transformed the result to pandas dataframe, for later usage.

In [22]:
%%sql result << 

SELECT Department, 
        COUNT(*) AS TotalStaff,
        ROUND(AVG(CASE WHEN Category='Research' THEN 1 ELSE 0 END),2) AS ResearchRatio, 
        ROUND(AVG(CASE WHEN Title='Dr' THEN 1 ELSE 0 END),2) AS DrRatio, 
        ROUND(AVG(CASE WHEN Title='Professor' THEN 1 ELSE 0 END),2) AS ProfRatio
FROM staff
GROUP BY Department;
Running query in 'sqlite:///Data/staff.db'
In [23]:
merge2=result.DataFrame()
merge2.head()
Out[23]:
Department TotalStaff ResearchRatio DrRatio ProfRatio
0 Anthropology 62 0.90 0.60 0.27
1 Economic History 66 0.65 0.29 0.35
2 Finance 95 0.41 0.31 0.18
3 Geography and Environment 128 0.59 0.23 0.18
4 Government 157 0.68 0.38 0.27
In [24]:
%sql --close sqlite:///Data/staff.db

Merging tables

After separate calculations, we need to merge the two tables into one which would make things easier later when visualizing the data and conducting investigations.

The merged table below now has everything we need to investigate the differences in departmentwise productivity and the reasons of the productivity variations.

All the necessary manipulation and resizing have been done. We wll then move to the analysis section in Section 4.

In [25]:
DF2=pd.merge(merge1,merge2,how='left',on='Department')
DF2['AverageProductivity']=DF2['TotalPublications']/DF2['TotalStaff']
DF2['AverageProductivity']=DF2['AverageProductivity'].round(2)
DF2=DF2[['Department','TotalPublications','AverageProductivity','TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio']]
DF2
Out[25]:
Department TotalPublications AverageProductivity TotalStaff ResearchRatio ProfRatio DrRatio CollabRatio
0 Anthropology 205 3.31 62 0.90 0.27 0.60 0.39
1 Economic History 173 2.62 66 0.65 0.35 0.29 0.48
2 Finance 130 1.37 95 0.41 0.18 0.31 0.76
3 Geography and Environment 667 5.21 128 0.59 0.18 0.23 0.57
4 Government 426 2.71 157 0.68 0.27 0.38 0.42
5 International Relations 378 3.78 100 0.75 0.21 0.48 0.35
6 Management 428 2.59 165 0.64 0.22 0.39 0.71
7 Mathematics 261 3.14 83 0.59 0.30 0.27 0.77
8 Psychological and Behavioural Science 579 5.36 108 0.76 0.17 0.52 0.55
9 Social Policy 519 6.83 76 0.75 0.29 0.46 0.58
10 Sociology 219 3.13 70 0.79 0.24 0.61 0.37
11 Statistics 298 4.97 60 0.87 0.27 0.55 0.76

4. Data Analysis¶

4a: How does research productivity vary across departments?¶

Based on the exploration and plots on the previous section, we found that some departments have consistently high number of publications, for instance, department of Geography and Environment. This can possibly be due to those departments are of larger sizes and have more staff.

Therefore, instead of looking at the overall publication numbers of each department which are affected by the department sizes, we now decide to focus on the productivity. We use the total number of publications from 2020 to 2023 divided by the total number of department staff as the productivity measure (assuming there are no major changes in number of staff in these years).

The tables are ordered by publications and average productvity respectively below.

In [26]:
display(DF2[['Department','TotalPublications']].sort_values(by='TotalPublications',ascending=False).head())
display(DF2[['Department','AverageProductivity']].sort_values(by='AverageProductivity',ascending=False).head())
Department TotalPublications
3 Geography and Environment 667
8 Psychological and Behavioural Science 579
9 Social Policy 519
6 Management 428
4 Government 426
Department AverageProductivity
9 Social Policy 6.83
8 Psychological and Behavioural Science 5.36
3 Geography and Environment 5.21
11 Statistics 4.97
5 International Relations 3.78

We can see that although Geography and Environment has the most publications, it is not the most productive one; Although Statistics is not ranked within top5 departments with most publications, in terms of average publication per staff during the 4 years, it is ranked as the top 4 among 12 departments

Although sometimes the departments with high publications also have competitive productivity. It is not a strong indication.

As shown by the relationship plot below

In [27]:
from scipy.stats import linregress
import matplotlib.pyplot as plt

DF2=DF2.sort_index()
fig, ax = plt.subplots(figsize=(15, 5))

ax.scatter(DF2['TotalPublications'], DF2['AverageProductivity'], color='blue', alpha=0.5)
ax.set_title('The Effect of Department Size on Departmental Average Productivity',fontsize=20,y=1.05)
ax.set_xlabel('Total Publications',fontsize=15)
ax.set_ylabel('Average Productivity',fontsize=15)

result = linregress(DF2['TotalPublications'], DF2['AverageProductivity'])
slope = result.slope
intercept = result.intercept
ax.plot(DF2['TotalPublications'], slope * DF2['TotalPublications'] + intercept, color='blue')

# Annotate each point with the department name
for j, txt in enumerate(DF2['Department']):
        # Check if the department is Anthropology or Government so as to set their annotation separately to avoid overlapping
        if txt in ['Anthropology', 'Government','Psychological and Behavioural Science']: 
            ax.annotate(txt, (DF2['TotalPublications'][j], DF2['AverageProductivity'][j]), 
                        xytext=(0, 3), textcoords='offset points', fontsize=11)
        else:
            ax.annotate(txt, (DF2['TotalPublications'][j], DF2['AverageProductivity'][j]), 
                        xytext=(5, -5), textcoords='offset points', fontsize=11)

plt.show()

When looking at the indicidual departments, we can see that although Anthropology, Sociology, and Mathematics have similar productivity, they do not share similar total publications. Mathematics nearly has 1.5 times publications as Anthropology.

Now turning to and focusing on departmentwise productivity, there is indeed quite a lot of variation. With the most productive department with nearly 7 publications per staff from 2020-2023 to the least productive department with only approximately 1 publication per person.

Productivity variations as shown in the boxplot below

In [28]:
plt.figure(figsize=(8, 3))

sns.boxplot(data=DF2, x='AverageProductivity', color='lightblue')
plt.title('Boxplot Distribution of Average Productivity')

max_value = DF2['AverageProductivity'].max()
min_value = DF2['AverageProductivity'].min()
plt.annotate(f'Max: {max_value}', xy=(max_value, 0), xytext=(max_value-0.7, 0.1))
plt.annotate(f'Min: {min_value}', xy=(min_value, 0), xytext=(min_value+0.05, 0.1))

plt.show()
In [29]:
DF2.sort_values(by='AverageProductivity',inplace=True)

palette = sns.color_palette("cividis", len(DF2))
#plt.barh(y=DF2.Department, width=DF2['Average Productivity'],alpha=0.6, color=palette);
bars = plt.barh(y=DF2.Department, width=DF2['AverageProductivity'], alpha=0.6, color=palette)
for bar in bars:
    plt.text(bar.get_width(), bar.get_y() + bar.get_height()/2, f'{bar.get_width():.2f}', ha='left', va='center')
plt.title('Productivity Differences Across LSE Departments',y=1.05);
    
plt.show()

Exploring the department productivity differences, we can see that Social Policy is the most productive department and the following three share similar productivity patterns around 5 per person: Psychological and Behavioural Science, Geography and Environment, Statistics.

And the rest of the departments performs silimar in terms of productivity, with the exception of Finance dropping suddenly from 3 to 1. This is probably due to the nature of the industry, as Finance is more of an applied subject and is more related to real-world applications rather than the academia.

4b: What are the factors that contribute to making an academic department more productive?¶

The analysis in 4a indicates significant variations in average productivity across different departments. In this part, we aim to delve deeper into the specific factors that could account for these differences, that is, the specific factors that contribute to making an academic department more productive.
Combined with the data we have, we have identified 5 possible factors that may contribute to the departmental overall productivity:
      - Department Size
      - Research Staff Ratio
      - Professor Ratio
      - Dr Ratio
      - External Collaborator Ratio

So, initially, we'll examine how these factors manifest within each department.
We will continue to utilize data from 2020 to 2023 (inclusive) for our analysis to maintain consistency.

4b.1 The Behavior of 5 Factors under Each Department¶

In [30]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(1, 5, figsize=(40, 18))

# Subplot titles
titles = ['TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio']

# Plot bar charts in each subplot
for i, (title, ax) in enumerate(zip(titles, axs)):
    # Sort the specified column in descending order
    sorted_DF2 = DF2.sort_values(by=title, ascending=True)
    
    # Calculate the mean value
    mean_value = sorted_DF2[title].mean()
    
    # Plot bars and set colors based on whether the value is above or below the mean
    for department, value in zip(sorted_DF2['Department'], sorted_DF2[title]):
        if value > mean_value:
            color = 'darkblue'
        else:
            color = 'lightblue'
        bar = ax.barh(y=department, width=value, color=color, alpha=0.6)
        ax.text(value, bar[0].get_y() + bar[0].get_height() / 2, f'{value:.2f}', ha='left', va='center', fontsize=12)
    
    # Draw a dashed line for the mean value
    ax.axvline(x=mean_value, color='yellow', linestyle='--', linewidth=5)
    
    # Set x-axis limits
    if i == 0: 
        ax.set_xlim(0, 200)
    else:
        ax.set_xlim(0, 1)
    
    ax.set_title(title, fontsize=33)
    ax.set_xlabel('')  # Hide x-axis label
    ax.tick_params(axis='both', labelsize=20) 
    ax.spines['top'].set_visible(False)  # Hide top border
    ax.spines['right'].set_visible(False)  # Hide right border
    ax.spines['left'].set_visible(False)  # Hide left border

plt.subplots_adjust(top=0.88, bottom=0.1, left=0.1, right=0.95, wspace=0.3)
plt.suptitle("Rankings of Department under Each Factor", fontsize=40)

plt.show()

In the above plots, it's straightforward to pinpoint departments that excel compared to the majority. Additionally, we can compare the performance of different indicators within the same department. For instance, in the case of the Department of Management, despite having the highest TotalStaff count, the other four ratios either underperformed or showed only marginal improvement compared to other departments. This observation might help explain why its departmental overall productivity is relatively low.

Regarding the four ratios, we observe that the average ResearchRatio is notably higher than the ProfRatio and DrRatio. Particularly, the average ProfRatio is the lowest compared to ResearchRatio and DrRatio. This aligns with our common understanding that acquiring the title of Professor is typically more challenging.

Next, we'll proceed to estimate the effect of each factor on departmental overall productivity.

4b.2 The Effect of Each Factor on Departmental Overall Productivity¶

Using Pairplot and Heat Map allows us to examine the pairwise relationships and correlations, providing a rough understanding of the effect of each factor on departmental overall productivity.

Pairplot

In [31]:
import warnings
warnings.filterwarnings('ignore')

# Adjusting the sequense of the varaibles so as to leave our dependent variable "AverageProductivity" on the y-axis in the last row of our graph
ax=sns.pairplot(DF2[['TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio','AverageProductivity']],
             kind='reg', diag_kind='kde',corner=True)
ax.figure.set_size_inches(18,8)

Heat Map

In [32]:
import numpy as np

corrMatrix=DF2[['AverageProductivity','TotalStaff','ResearchRatio','ProfRatio','DrRatio','CollabRatio']].corr().round(2)
mask = np.triu(np.ones_like(corrMatrix, dtype=bool))[1:,:-1]
sns.heatmap(corrMatrix.iloc[1:,:-1], mask=mask, vmin=-1, vmax=1, center=0, cmap='coolwarm', linewidths=.5,
            annot=True, square=True, annot_kws={"fontsize":8}, cbar_kws={"shrink":.8})
plt.xticks(rotation=30, ha='right');

We observe positive correlations between Departmental Overall Productivity and Research Staff Ratio as well as Dr Ratio, whereas the Department Size, ProfRatio, and CollabRatio exhibit negative correlations. Particularly, the correlations between Departmental Overall Productivity and ProfRatio, as well as CollabRatio, are negative but insignificant.

Specifically, we observe a strongly positive correlation between DrRatio and ResearchRatio. This correlation is likely attributable to the fact that staff with a "Dr" title constitute the majority of the Research Staff within the department.

4b.3 Evaluating the Effect of Each Factor¶

(1) The Effect of Department Size on Productivity

We observe a slightly negative relationship between Departmental Overall Productivity and Department Size. However, we hypothesized that a larger department size implies a more extensive and comprehensive department, with higher management standards and a more diverse research field. This could potentially enhance the research productivity of staff, especially research staff, thereby increasing departmental overall productivity.

The potential reasons for the deviation of our observed results from our initial hypothesis can be attributed to two main reasons.

Reason 1: Issues alongside our methodology for calculating average productivity

Departmental size positively influences total publications, however, the productivity per capita tends to decrease as the department grows larger. Therefore, since our dependent variable is measured by TotalPublications averaged by Department Size, if the stimulating effect of department size on total publications is not counterbalanced by the increase in department size, average productivity will naturally exhibit a negative correlation with department size.

Reason 2: The existence of confounders

As department size increases, the research staff ratio tends to decrease. Because larger departments have a greater number of Non-Research Staff, leading to a decrease in the research staff ratio, consequently resulting in a decline in departmental overall productivity.

If our hypothesis holds, it suggests that the impact of research staff ratio on departmental overall productivity is more significant, while the promotional effect of departmental size on departmental overall productivity is relatively weak and partially offset.

If we want to further investigate the influence of department size on departmental overall productivity, we need to control for research staff ratio.

From the data, we can visually observe that for the Department of Geography and Environment and Mathematics, both departments have the same research staff ratio. However, the department size of the Department of Geography and Environment is significantly larger than that of Mathematics, and the average productivity of the Department of Geography and Environment is also significantly higher than that of Mathematics.

In [33]:
display(DF2.loc[[3, 7], ['Department', 'AverageProductivity', 'TotalStaff', 'ResearchRatio']])
Department AverageProductivity TotalStaff ResearchRatio
3 Geography and Environment 5.21 128 0.59
7 Mathematics 3.14 83 0.59

(2) The Effect of Departmental Research Staff Ratio on Productivity

We observe a positive and significant relationship between Departmental Overall Productivity and Research Staff Ratio, aligning with our expectations.

Common sense suggests that research productivity is primarily driven by research staff who focus on research rather than teaching or administrative tasks. Therefore, the presence of research staff significantly influences a department's overall productivity. Consequently, departments with a higher ratio of research staff are likely to exhibit greater research productivity.

In addition to the internal factors affecting departmental average productivity, we also consider external effects. A department with a higher research staff ratio likely reflects a greater emphasis on the research field within that department. This heightened focus on research could potentially inspire non-research staff to become more involved in research activities. Consequently, this increased interest in research among non-research staff could contribute to enhancing overall departmental productivity.

(3) The Effect of Departmental Professor ratio on Productivity

We observe an insignificant relationship between Departmental Overall Productivity and ProfRatio, which aligns with our initial hypothesis that the proportion of professors may not affect the departmental overall productivity significantly since professors are not the only personnel who are engaged in research activities.

(4) The Effect of Departmental Dr ratio on Productivity

We observe a positive and relatively significant relationship between Departmental Overall Productivity and DrRatio. This can be explained as there is a large proportion of Dr.s who are capable of conducting researches within each department, therefore having more significant impact on the average productivity, whereas the professors only form a small part with a minor impact.

(5) The Effect of Departmental External Collaborator Ratio on Productivity

We observe a negative but insignificant relationship between Departmental Overall Productivity and CollabRatio, which slightly contradicts our initial thought. We speculated that the presence of more external collaborators in a department would foster closer ties to academia, potentially stimulating internal productivity. However, it appears that our assumption is not supported, suggesting that external collaborations may not significantly impact overall productivity. This discrepancy could be attributed to various confounders. For instance, it can be argued that it is more difficult in publishing researches in the field of Mathematics; despite having the highest CollabRatio, department of Mathematics performs below average in terms of productivity.

In [34]:
display(DF2.loc[[7], ['Department', 'AverageProductivity', 'CollabRatio']])
means = DF2[['AverageProductivity', 'CollabRatio']].mean()
print(means)
Department AverageProductivity CollabRatio
7 Mathematics 3.14 0.77
AverageProductivity    3.751667
CollabRatio            0.559167
dtype: float64

5. Conclusion¶

This project aimed to conduct an in-depth analysis of research productivity at LSE across various departments. Leveraging data primarily from 2020-2023 and utilising data acquisition, cleaning, analysis, and visualising skills, we explored potential factors contributing to departmental productivity and tried to understand the dynamics shaping these variations.

Our report discovered that the departments with the highest number of publications during the chosen time period, Geography and Environment, Psychological and Behavioural Science, and Social Policy, do not necessarily correlate to their position in average productivity, as measured by the average number of publications produced by a staff member. In average productivity, the department of Social Policy has the highest ranking with a number of 6.83 papers per staff member as compared to the lowest, the department of Finance, with an average productivity of 1.37, indicating a quite high degree of inconsistency between departments.

We can guess that this is due to the nature of the subjects themself, however further investigation into individual factors reveals that the proportion of research staff has a direct positive correlation with average productivity, with a coefficient of 0.46. Other factors seem to have negligible effects on average productivity.

Based on this, LSE should possibly allocate more funding to the department of Social Policy as it has the highest research productivity per staff and should, in the future, not allocate research fund only based on the number of staff members but also their productivity. In order to promote productivity, according to our analysis, LSE should focus more on increasing research based staff members, particularly doctors, and not just simply staff size or collaboration.

Limitations¶

While we do come upon a conclusive answer, there are still several limitations of our analysis to consider, most stemming from the imperfect data availability and the inferences we had to make due to incomplete data. First of all, we noticed while gathering data that a lot of the professors in fact had doctorates, and so the distinction between professor and dr isn't very clear and may in fact be an arbitary preferred title. Another limitation of our analysis is the fact that we only used the departmental data for 12 departments, while LSE has 27 departments. We also assumed that there haven't been major staff changes in the past four years, which may not be the case for each department, particularly departments which recently introduced new programs. Our categorisation of each staff member to research or non-research based could have easily wrongly sorted a few of the staff members. While considering productivity, we look at the number of research published as it is easy to rank and perform calculations on. What this analysis fails to consider is that quality can be just as important a factor in productivity.

Further Analysis¶

For further analysis, we would recommend first getting more accurate data as well as collecting data from more departments, ideally all departments within LSE. It might be worthwhile to also get the staff information available from previous years to make the analysis as accurate as possible. An interesting consideration could also be how LSE ranks with other universities and if the same factors affect their average productivity per department to the same extent.

As we are considering multiple factors when it comes to research productivity, multivariate regression would be a helpful tool. Potentially using the econometric model of 𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑃𝑟𝑜𝑑𝑢𝑐𝑡𝑖𝑣𝑖𝑡𝑦 = 𝛼 + 𝛽 × 𝑇𝑜𝑡𝑎𝑙𝑆𝑡𝑎𝑓𝑓 + 𝛾 × 𝑅𝑒𝑠𝑒𝑎𝑟𝑐ℎ𝑅𝑎𝑡𝑖𝑜 + 𝜀 to solve the problem stated in 4b.3(1).

References¶

LSE Staff Information per Department
Anthropology: https://www.lse.ac.uk/anthropology/people
Economic History: https://www.lse.ac.uk/Economic-History/People
Finance: https://www.lse.ac.uk/finance/people
Geography and Environment: https://www.lse.ac.uk/geography-and-environment/our-people
Government: https://www.lse.ac.uk/government/people
International Relations: https://www.lse.ac.uk/international-relations/people
Management: https://www.lse.ac.uk/management/people-home
Mathematics: https://www.lse.ac.uk/Mathematics/people
Psychological and Behavioural Science: https://www.lse.ac.uk/pbs/people
Social Policy: https://www.lse.ac.uk/social-policy/people
Sociology: https://www.lse.ac.uk/sociology/people
Statistics: https://www.lse.ac.uk/statistics/people

LSE Research Publications
https://eprints.lse.ac.uk/